Requirements

requirements=c("tidyverse","mice", "caTools", "corrplot", "summarytools", "plotly")

for (req in requirements){
  if (!require(req, character.only = TRUE)){
      install.packages(req)
  }
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: mice
## 
## 
## Attaching package: 'mice'
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## 
## 
## Loading required package: caTools
## 
## Loading required package: corrplot
## 
## corrplot 0.94 loaded
## 
## Loading required package: summarytools
## 
## 
## Attaching package: 'summarytools'
## 
## 
## The following object is masked from 'package:tibble':
## 
##     view
## 
## 
## Loading required package: plotly
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Introduction

The objective of this project is to analyze the statistical data of the Spanish La Liga football league spanning the last 10 seasons, including the current ongoing season. The dataset, sourced from http://www.football-data.co.uk/, provides comprehensive information on various aspects of each match, including final and half-time results, corner kicks, and disciplinary actions such as yellow and red cards. This dataset serves as a valuable resource for understanding the dynamics of football matches in one of Europe’s top football leagues.

Data Description

The dataset comprises detailed statistical records of matches played in the Spanish La Liga over the past decade. Each record includes information such as match date, teams involved, final and half-time scores, number of corner kicks, as well as disciplinary actions like yellow and red cards. The data is updated on a weekly basis via Travis-CI, ensuring its timeliness and relevance to ongoing analysis.

The different information of each match collected on the dataset is described in the following table:

Label Description
Date Date of the match
HomeTeam Home Team of the match
AwayTeam Away Team of the match
FTHG Full Time Home Team Goals
FTAG Full Time Away Team Goals
FTR Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG Half Time Home Team Goals
HTAG Half Time Away Team Goals
HTR Half Time Result (H=Home Win, D=Draw, A=Away Win)
HS Home Team Shots
AS Away Team Shots
HST Home Team Shots on Target
AST Away Team Shots on Target
HF Home Team Fouls Committed
AF Away Team Fouls Committed
HC Home Team Corners
AC Away Team Corners
HY Home Team Yellow Cards
AY Away Team Yellow Cards
HR Home Team Red Cards
AR Away Team Red Cards

Analysis description

This analysis is based on a binary classification task: Can Barcelona win the LaLiga 2024-2025 season title?

Specifically, I will answer the following questions:

  • To identify trends and patterns in match outcomes over the past 10 seasons of La Liga.
  • To explore the impact of various factors such as home advantage, team form, and disciplinary actions on match results.
  • To investigate any correlations between specific match statistics and overall team performance throughout the dataset period.
  • To gain insights into potential predictors of match outcomes and assess the predictive power of statistical models.

Data exploration and cleaning

The CSV file downloaded from the website contains data for each season of the Spanish La Liga, starting from the 2009/2010 season and spanning up to the 21st of October of the 2024/2025 season. Each season’s data is structured with various match statistics, including final and half-time scores, team information, and disciplinary actions. The dataset provides a comprehensive overview of match outcomes and related metrics for analysis spanning multiple seasons.

I filtered out qualitative variables and statistics related to betting from the dataset, retaining only the essential match statistics for subsequent analysis.

# Load the required library
library(readr)

# Read the dataset from the CSV file
football_data <- read_csv("./dataset.csv")
## Rows: 5800 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, HomeTeam, AwayTeam, FTR, HTR
## dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the first few rows of the dataset
head(football_data)
## # A tibble: 6 × 21
##   Date   HomeTeam AwayTeam  FTHG  FTAG FTR    HTHG  HTAG HTR      HS    AS   HST
##   <chr>  <chr>    <chr>    <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 29/08… Real Ma… La Coru…     3     2 H         2     1 H        28     9    11
## 2 29/08… Zaragoza Tenerife     1     0 H         0     0 D        17    16     8
## 3 30/08… Almeria  Vallado…     0     0 D         0     0 D        20     7     5
## 4 30/08… Ath Bil… Espanol      1     0 H         0     0 D        14     8     4
## 5 30/08… Malaga   Ath Mad…     3     0 H         1     0 H         8    16     4
## 6 30/08… Mallorca Xerez        2     0 H         0     0 D        10     7     3
## # ℹ 9 more variables: AST <dbl>, HF <dbl>, AF <dbl>, HC <dbl>, AC <dbl>,
## #   HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>

To ensure the integrity of our analysis, we need to clean the data by checking for missing values, duplicate entries, and inconsistencies in data types.

# Check for missing values
missing_values <- colSums(is.na(football_data))
missing_values[missing_values > 0]
## named numeric(0)
# Convert necessary columns to appropriate data types
football_data$FTR <- factor(football_data$FTR, levels = c("H", "D", "A"), labels = c("Home Win", "Draw", "Away Win"))

# Summary of the cleaned dataset
summary(football_data)
##      Date             HomeTeam           AwayTeam              FTHG       
##  Length:5800        Length:5800        Length:5800        Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 1.000  
##                                                           Mean   : 1.547  
##                                                           3rd Qu.: 2.000  
##                                                           Max.   :10.000  
##       FTAG             FTR            HTHG             HTAG       
##  Min.   :0.000   Home Win:2720   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   Draw    :1455   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.000   Away Win:1625   Median :0.0000   Median :0.0000  
##  Mean   :1.125                   Mean   :0.6855   Mean   :0.4928  
##  3rd Qu.:2.000                   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :8.000                   Max.   :6.0000   Max.   :5.0000  
##      HTR                  HS              AS             HST        
##  Length:5800        Min.   : 1.00   Min.   : 0.00   Min.   : 0.000  
##  Class :character   1st Qu.:10.00   1st Qu.: 8.00   1st Qu.: 3.000  
##  Mode  :character   Median :13.00   Median :10.00   Median : 4.000  
##                     Mean   :13.61   Mean   :10.74   Mean   : 4.859  
##                     3rd Qu.:17.00   3rd Qu.:13.00   3rd Qu.: 6.000  
##                     Max.   :37.00   Max.   :39.00   Max.   :18.000  
##       AST               HF              AF              HC        
##  Min.   : 0.000   Min.   : 1.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 2.000   1st Qu.:11.00   1st Qu.:11.00   1st Qu.: 4.000  
##  Median : 3.000   Median :14.00   Median :14.00   Median : 5.000  
##  Mean   : 3.768   Mean   :13.95   Mean   :13.79   Mean   : 5.673  
##  3rd Qu.: 5.000   3rd Qu.:17.00   3rd Qu.:17.00   3rd Qu.: 7.000  
##  Max.   :16.000   Max.   :33.00   Max.   :31.00   Max.   :20.000  
##        AC               HY              AY             HR        
##  Min.   : 0.000   Min.   :0.000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:0.0000  
##  Median : 4.000   Median :2.000   Median :3.00   Median :0.0000  
##  Mean   : 4.355   Mean   :2.423   Mean   :2.65   Mean   :0.1228  
##  3rd Qu.: 6.000   3rd Qu.:3.000   3rd Qu.:4.00   3rd Qu.:0.0000  
##  Max.   :17.000   Max.   :9.000   Max.   :9.00   Max.   :3.0000  
##        AR        
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1516  
##  3rd Qu.:0.0000  
##  Max.   :3.0000

Univariate Analysis

Distribution of Match Outcomes

library(plotly)
# Plot the distribution of match results
p1 <- ggplot(football_data, aes(x = FTR)) +
  geom_bar(fill = "lightblue", color = "black") + 
  labs(title = "Distribution of Match Results", x = "Match Outcome", y = "Count") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1))  
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p1_interactive <- ggplotly(p1)

p1_interactive

Goals Scored Distribution

# Histogram of Home Team Goals
p2 <- ggplot(football_data, aes(x = FTHG)) +
  geom_histogram(aes(text = ..count..), bins = 10, fill = "green", alpha = 0.7, color = "black") +  
  labs(title = "Distribution of Home Team Goals", x = "Goals", y = "Count") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1))  
## Warning in geom_histogram(aes(text = ..count..), bins = 10, fill = "green", :
## Ignoring unknown aesthetics: text
p2_interactive <- ggplotly(p2, tooltip = "text")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2_interactive
# Histogram of Away Team Goals
p3 <- ggplot(football_data, aes(x = FTAG)) +
  geom_histogram(aes(text = ..count..), bins = 10, fill = "red", alpha = 0.7, color = "black") +  
  labs(title = "Distribution of Away Team Goals", x = "Goals", y = "Count") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1)) 
## Warning in geom_histogram(aes(text = ..count..), bins = 10, fill = "red", :
## Ignoring unknown aesthetics: text
p3_interactive <- ggplotly(p3, tooltip = "text")

p3_interactive

Bivariate Analysis

Home Advantage

# Analyze home advantage
p_home_advantage <- ggplot(football_data, aes(x = FTR, fill = FTR)) +
  geom_bar(aes(text = ..count..), position = "dodge", color = "black") +  
  labs(title = "Home Advantage in Match Outcomes", x = "Match Result", y = "Count") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1))  
## Warning in geom_bar(aes(text = ..count..), position = "dodge", color =
## "black"): Ignoring unknown aesthetics: text
ggplotly(p_home_advantage, tooltip = "text")

Goals vs. Match Outcome

# Boxplot of Home Goals vs. Match Outcome
p_home_goals <- ggplot(football_data, aes(x = FTR, y = FTHG)) +
  geom_boxplot(aes(text = paste("Home Goals: ", FTHG)), fill = "lightblue", color = "black") +  
  labs(title = "Home Goals vs. Match Outcome", x = "Match Outcome", y = "Home Goals") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_boxplot(aes(text = paste("Home Goals: ", FTHG)), fill =
## "lightblue", : Ignoring unknown aesthetics: text
ggplotly(p_home_goals, tooltip = "text")
# Boxplot of Away Goals vs. Match Outcome
p_away_goals <- ggplot(football_data, aes(x = FTR, y = FTAG)) +
  geom_boxplot(aes(text = paste("Away Goals: ", FTAG)), fill = "lightgreen", color = "black") +  
  labs(title = "Away Goals vs. Match Outcome", x = "Match Outcome", y = "Away Goals") +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_boxplot(aes(text = paste("Away Goals: ", FTAG)), fill =
## "lightgreen", : Ignoring unknown aesthetics: text
ggplotly(p_away_goals, tooltip = "text")